Now that we have calculated some summary statistics, we will want to work on visualizing these data in figures. There are many different and appropriate ways to visualize these types of data, but we will explore just a few in the course: bar plot, line graph, and scatter plot.
Here is a really nice website for making charts and graphs in R. It is broken down by types of data so may also be helpful when exploring your own data!
To practice making figures, we are going to again work with the
iris data set so go ahead and load that below:
Do you remember the structure of this data? Let’s review the structure below:
Great! So you know the Species is a factor
and that the other parameters are numeric and that you have
150 observations, or rows of data. Time to make out first figure!
To begin, we will create bar plots of the iris data to
visualize some of the parameters from the data set. Bar plots are useful
for depicting values of categorical variables, mean and standard errors,
and percentages. There are different ways to create figures in R,
however, the tidyverse provides a simple, yet highly
editable function for creating figures that we will use here:
ggplot2().
Go ahead and install and load the ggplot2() package
below:
As always, as you get more comfortable with R, you should explore the package documentation to better understand what you can do with the package, but for now we will just start with a basic bar plot.
While the package is called ggplot2(), the actual
function to create the plot is just ggplot(). If you just
run the function in your script, it will produce a blank space (and will
pop up under your Plots area in your R Studio window if
have it open).
The function ggplot() allows a lot of creativity for
creating figures, but most figures require the same basic calls within
the function to run: a data frame, axis variables (x, y, or both), and
what type of graph to produce. Again, if we just provide the data frame
(data = iris) and the aesthetics with the y axes
(aes(y = Species)), we now see the blank space populated
with a grey background, a x axis label of Species and
the three species names (also on the y axis), but no data is
plotted.
The reason we do not have any data plotted in the above space is
because we need to specify HOW to display the data. For example, to make
a simple bar plot to show the the number of counts of each
Species in the data frame, we need to tell
ggplot() we want to produce a bar plot
(geom_bar()) and we do that by adding another function to
the base ggplot() function like below:
We now have a bar graph that displays the Species of
iris on the x axis and bars that show the number of each
Species within the data frame along the y axis. This is a
great way to visualize the sample size, or N, of each
Species.
Bar plots are also useful to display summary statistics of data.
Based on what you learned in the previous tutorial, go ahead and
calculate the mean, sample size, standard deviation, and
standard error of the Sepal.Width per
Species using both summarySE() and
tidyverse below (name the summary data frame
sepal_length_summary):
## Load the package that contains the summarySE() function
## Load the tidyverse
## Calculate summary stats using summarySE()
## Calculate summary stats using the tidyverseDouble check that your summary statistics make sense, and then we can
create a new bar plot to show the mean Sepal.Width per
Species. This is slightly different than the count plot
code above.
data to specify the
sepal_length_summary data frame.mean_Sepal.Length.geom_bar() function, you want to
specify stat = "identity". This is the distinction that is
required to tell ggplot() that you will be supplying the y
values to plot instead of the count.ggplot(data = sepal_length_summary, aes(x = Species, y = mean_Sepal.Length)) +
geom_bar(stat = "identity")This is good, but lets add a more neat y axis label. We can update
the x and y labels by adding another line of code that calls the
function labs(). Here, we are only going to update the y
axis, but you can also use this to add a new x axis title if you
need.
ggplot(data = sepal_length_summary, aes(x = Species, y = mean_Sepal.Length)) +
geom_bar(stat = "identity") +
labs(y = "Mean sepal length")Next, we want to add error bars to each bar to depict the the
standard deviation from the mean. This time, we need to add a new
function to our figure code called geom_errorbar(). Within
that function, you again want to pass information to the function using
the aes() call.
ggplot(data = sepal_length_summary, aes(x = Species, y = mean_Sepal.Length)) +
geom_bar(stat = "identity") +
labs(y = "Mean sepal length") +
geom_errorbar(aes(ymin = mean_Sepal.Length - sd_Sepal.Length, ymax = mean_Sepal.Length + sd_Sepal.Length))Looks great! We can now see what kind of deviation around the mean
there is for each iris species. The last thing we will do here is change
the colours of the bars so that the each species is represented by a
different colour. This part of the code is added within the initial
ggplot() function in the aes() call. With a
bar plot, adding fill = Species will give each species name
a different colour (compare this with using
colour = Species which only changes the bar outline
colour).
ggplot(data = sepal_length_summary, aes(x = Species, y = mean_Sepal.Length, fill = Species)) +
geom_bar(stat = "identity") +
labs(y = "Mean sepal length") +
geom_errorbar(aes(ymin = mean_Sepal.Length - sd_Sepal.Length, ymax = mean_Sepal.Length + sd_Sepal.Length))We now have an informative bar plot that shows the mean sepal length
(mean_Sepal.Length) per Species with the
standard deviation, with different colours per Species.
This is a helpful resource for making and customizing barplots in ggplot and this is from the ggplot2 website.
While the above plots are informative ways to visualise many types of
data, these are not always the best plots for all data. For
example, if we want to know what the relationship between
Sepal.Length and Petal.Length of an individual
flower is, we might want to use a scatter plot to
visually represent that relationship. We again will use
ggplot() to produce this figure. Go ahead and start to make
a ggplot() figure with Sepal.Length along the
x axis and Petal.Length along the y axis. Also assign
ggplot(data = iris, aes(x = Sepal.Length, y = Petal.Length)) +
labs(x = "Sepal length", y = "Petal length")Again, this should produce a blank plot with the appropriate x and y
axes, but without any data. To add the points to represent each sample
in a scatter plot, you will add the function geom_point()
just like we added geom_bar() for the bar plot
previously.
ggplot(data = iris, aes(x = Sepal.Length, y = Petal.Length)) +
labs(x = "Sepal length", y = "Petal length") +
geom_point()Looks great! You can now see a general pattern that as the
Sepal.Length increases, you see a similar increase in the
Petal.Length. You may notice that in the bottom left of the
scatter plot, there is a group of points that do not conform to this
pattern. What could be causing it? Since we have combined data from
three different Species of flowers here, maybe that is
causing that pattern! So to look at how the relationship of
Sepal.Length and Petal.Length changes across
species, we can add a call to have each point a different colour based
on Species just like we did in the bar graph (except we can
just use colour = Species instead of
fill = Species in this specific case). Go ahead and add
that into the aes() portion of the ggplot()
function.
ggplot(data = iris, aes(x = Sepal.Length, y = Petal.Length, colour = Species)) +
labs(x = "Sepal length", y = "Petal length") +
geom_point()Now we can clearly see that the setosa Species does not
fit the pattern of increasing Petal.Length along with
increasing Sepal.Length like the other two
Species. And we can actually quantify these differences by
looking at the slopes of the linear
regressions of each species! We will discuss performing a
linear regression in R below, but we can plot these regressions easily
with ggplot() to visualize those relationships. To add a
line to represent a simple linear regression to the data for each
species, we will add another function to the plot code called
geom_smooth(). This can add visual representations of
several different statistical models (read more about it here),
but for now we will focus on supplying it with the call
method = lm which will run a simple linear regression.
ggplot(data = iris, aes(x = Sepal.Length, y = Petal.Length, colour = Species)) +
labs(x = "Sepal length", y = "Petal length") +
geom_point() +
geom_smooth(method = lm)## `geom_smooth()` using formula = 'y ~ x'
We now have three lines plotted on our three sets of species data! The line with the colour matching the points represented the linear model (think \(y = mx + b\); we will discuss this more below), while the grey area around the line (referred to often as a ribbon) represents the standard error (\(SE\)) of the regression.
NOTE: Because we specified that
colour = Species in the ggplot() aesthetics,
the function knew to assign a different linear regression per species,
hence, why there are three different lines matching the colours of the
species-specific data. If we wanted to calculate an overall linear
model, we would do that by specifying group = 1 in the
ggplot() aesthetics portion of the code like below:
ggplot(data = iris, aes(x = Sepal.Length, y = Petal.Length, colour = Species, group = 1)) +
labs(x = "Sepal length", y = "Petal length") +
geom_point() +
geom_smooth(method = lm)## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
As we can see from the figure above, there is some sort of
relationship between the Sepal.Length and
Petal.Length of an iris flower. We can see that in the
scatter plot above, but how can we statistically test this relationship?
We can use a linear regression of the two continuous parameters with the
function lm() in R. By performing a simple linear
regression on these data, we will be able to calculate the slope (\(m\)) and intercept (\(b\)) of the line of best fit, report the p
value (or significance level) of the relationship, and also the
correlation coefficient (R2).
The basic structure of the lm() function requires us to
supply the data (data) and then specify the
response variable (y, this is generally
depicted on the y axis and is the variable we anticipated being
influenced by the predictor) and the predictor
variable(s), which is what we expect to change the response
variable (x), in the function below (You can read more
about the lm() function here.):
lm(y ~ x, data)
For the data we have been working with, the lm()
function will be written like shown below:
##
## Call:
## lm(formula = Petal.Length ~ Sepal.Length, data = iris)
##
## Coefficients:
## (Intercept) Sepal.Length
## -7.101 1.858
After running the linear regression on these data, we see R has
produced an output that repeats the model we performed
(Call:), as well as produces two values listed under
Coefficients: (Intercept) and
Sepal.Length. Like the name suggests, the
(Intercept) provides the y intercept (\(b\)) of the linear regression (again,
remember \(y = mx + b\), where \(b\) represents the slope of the line). The
other variable provided, Sepal.Length, actually represents
the slope (\(m\)) of the line. So from
performing the linear regression, we are now able to report the equation
of the line that best represents the data like so:
\[y = 1.858433x + -7.101443\]
Now that you know the equation of the line, it is also good to look
at the summary of your linear regression. You can view the summary of a
linear regression model by placing the above lm() function
into the summary() function like shown below:
##
## Call:
## lm(formula = Petal.Length ~ Sepal.Length, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.47747 -0.59072 -0.00668 0.60484 2.49512
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.10144 0.50666 -14.02 <2e-16 ***
## Sepal.Length 1.85843 0.08586 21.65 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8678 on 148 degrees of freedom
## Multiple R-squared: 0.76, Adjusted R-squared: 0.7583
## F-statistic: 468.6 on 1 and 148 DF, p-value: < 2.2e-16
We can see a lot of information about this linear regression now from
the summary() function. The summary is made up of a couple
different parts:
Call: This is the model being assessed by the
summary() functionResiduals: The section summarizes the residuals, the
error between the prediction of the model and the actual results.
Smaller residuals are better.Coefficients: For each variable and the intercept, a
weight is produced and that weight has other attributes like the
standard error, a t-test value and significance.
Performance Measures
For now, the information that we are most interested in from this list is the Adjusted R-squared (0.7583) value and the p-value (< 2.2e-16). Remember from the first tutorial, that the R2 values represents the proportion of the variance and that the closer R2 is to 1, the more strongly related the two variables are. From this R2, we can see about 76% of the sample variation is explained by the linear model. Finally, the p-value in the performance measures portion of the summary output represents the significance of the model. When the p-value is below 0.05, we can say that there is a significant relationship between the two variables.
Still confused about running a linear regression in R? Check out this video.
We have been working with the data already available in R, however, you are able to also analyze your own data that you can read into R. For example, your second R homework assignment will be working with the temperature and salinity data you have been collecting in class. You should have attempted to create an excel spreadsheet of these data, but now we will discuss how to organize the data in ways that will work for analysis in R. Additionally, it is MUCH easier to work with CSV files than XLS/XLSX files. We will do this in lab together and then you can proceed below once you have all your data in the .csv. file.
These data are now saved as a .csv file but can by uploaded into R
for assessment just like we did with the iris data. Before
you begin, make sure the file WaterData_spring24.csv is
saved in the same folder as this code (this is your working directory.
You can check your working directory through the user interface of R
Studio or by running the command getwd()). Once you ensure
the .csv file is there, you can run the following code to read in your
.csv data as a data frame and give it a name (here we will name it
water_data).
Couple things to note here: 1. Your file name needs
to be in quotation marks ("") or R will not be able to
identify it. 2. If your .csv file is NOT in the same directory
that you are working in, you can provide the direct path to it with
different folders being separated by / (like in this
example). 3. We have also specified header = TRUE in the
read.csv() function. This tells R that the first row of
each column in your .csv file corresponds to the column name or ID (aka,
the column header).
You can now use what you have learned previously to check out what
the data frame looks like using the head() function and
check on the structure (using str()). Do that below:
This R assignment will be due by the the start of next lab. You will submit both your R markdown code and the Word output of your work. This assignment can be completed with other members from your lab section but if you work with others, please include their names on your submission.
While you should all be familiar with the data you have been collecting in lab, let’s do a quick refresher! The data you are analyzing has been collected during Marine Biology lab this semester (Spring 2024) from two small aquariums full of salt water (see photo below). The two different tanks, labelled A and B, vary in measured temperature and salinity across the several time points of measurement. In your .csv data file, you should have the following information included: